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Applying Machine Learning to Catalogue Matching in 
Astrophysics 
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ABSTRACT 

We present the results of applying automated machine learning techniques to the 
problem of matching different object catalogues in astrophysics. In this study we take 
two partially matched catalogues where one of the two catalogues has a large positional 
uncertainty. The two catalogues we used here were taken from the HI Parkes All Sky 
Survey (HIPASS), and SuperCOSMOS optical survey. Previous work had matched 
44% (1887 objects) of HIPASS to the SuperCOSMOS catalogue. 

A supervised learning algorithm was then applied to construct a model of the 
matched portion of our catalogue. Validation of the model shows that we achieved a 
good classification performance (99.12% correct). 

Applying this model, to the unmatched portion of the catalogue found 1209 new 
matches. This increases the catalogue size from 1887 matched objects to 3096. The 
combination of these procedures yields a catalogue that is 72% matched. 

Key words: catalogues, astronomical data bases : miscellaneous 



1 INTRODUCTION 

The Virtual Observatory will bring new opportunities and 
new challenges. Our study works with a problem that may 
become typical in the virtual observatory context: the prob- 
lem of matching catalogues with significant positional un- 
certainties. 

The Virtual Observatory will allow efficient access to 
the vast amounts of data being collected by all sky surveys in 
many wavelengths. A fundamental operation for increasing 
the utility of this data will be the matching of catalogues. 
Matching catalogues will utilise many components of the 
virtual observatory. The main task of these services will be 
to perform a fuzzy (probabilistic) distributed spatial join. 
Distributed computing is required so that catalogues can 
be published at appropriate sites all over the world. Spe- 
cial indexes have also been developed to a id in doing fast 
spati al joins; at present Open Sky Query (jBudavari et alJ 
2004) is leading progress toward making this a reality. The 
study reported in this paper is focused on the fuzzy or prob- 
abilistic component of this problem. That is, for a given 
source, how is the correct counterpart chosen out of a num- 
ber of candidate matches within the error ellipse? Supervised 
learning techniques have already been app lied to the astron- 
omy problems of star-galaxy classification (jBertin &; Arnoutl 
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1996: lAndreon et all2 000'). galaxy morphology classification 
(Baz ell &: Aha||200ljl and th e search for quasars in photo- 
metric data ([Richards et ail 1200 lh . A review paper of as- 
tron omical applicat i ons in machine learning can be found 
in iTagliaferri et all f|2003T) . Both within astronomy and in 
other applications the focus of supervised learning tech- 
niques is on regression or pattern classification. The specific 
type of pattern classification problem (the matching prob- 
lem) which we consider here is reasonably novel and the 
authors believe warrants further attention. A related, but 
underdevelope d field in computer sc ience is the problem of 
record linkage (jFellegi &; SunteJll969r) . The solution to the 
problem of matching catalogues is likely to have an impact 
on record linkage, which demonstrates just one way that 
the development of the virtual observatory may impact on 
fields outside astronomy. Borrowing from computer science 
this paper uses the term linkage to refer to the problem of 
resolving the ambiguity in the matching problem. We draw 
the distinction between this problem and the computational 
and network problems associated with matching catalogues. 
The database term of joining suggests itself as being appro- 
priate for describing catalogue matching problems focused 
on the computational or distributed nature of the problem. 
In this paper we focus on the problem of linkage. 

A number of simple approaches to linkage are com- 
monly used in astronomy. Often taking the closest match 
(in terms of position only) is considered adequate especially 
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when the positional u ncertainties are small, for example 
IPrinkwater et all ([19971) (positional uncertainties are of the 
order of arcseconds). Another more sophisticated technique, 
the likelihood ratio, compares the probability that the ob- 
ject is a match in comparis on to the probability that it i s 
a chance background object ^Sutherland fc Saunderslll992l) . 
The likelihood ratio itself only utilises a small number of pa- 
rameters (and only those from the more dense catalogue). 

Our work uses supervised learning techniques from ma- 
chine learning in order to link these two catalogues using 
all available information. Our overall goal is to provide a 
proof of concept that the full parameter list (or an intelli- 
gently chosen subset) contains useful information that can 
be used to reliably link catalogues. While a simple method of 
performing linkage using a simple supervised learning algo- 
rithm (decision t rees) has been previously demonstrated by 
IVoisin fc Donasl (l200ll) . no follow up work on the topic has 
been published. This study offers a more complete treatment 
of the problem in a number of ways. External information is 
used to construct the training set; Voisin and Donas simply 
used a cut on proximity to assign labels. We also analyse the 
scientific implications of different matching algorithms and 
invest iage different and arguably more powerful algorithms. 

The linkage method we propose is well suited to a cer- 
tain class of problem. First of all there must be a significant 
linkage problem; the positional uncertainty of one catalogue 
must be large enough that there are frequently multiple can- 
didate links in the more dense catalogue. This method also 
requires that there is a minimum and maximum amount of 
information available. There must be a significant subset of 
the catalogues that is already linked; this is vital for us to 
pursue a supervised learning procedure. It is also important 
that a significant subset of the catalogue remains unlinked 
in order for the procedure to cause a significant increase in 
the catalogue size. 

The problem we discuss here involves joining a cata- 
logue with comparatively poor positional uncertainties (HI- 
CAT, Me ver et al.1 (120041) ) to a catalog ue with good po si- 
tional uncertainties f SuperCOSMOS. iHamblv et al.1 (l2QQll) ). 
In general the positional resolution of the survey affects both 
the positional uncertainties and the density of sources per 
unit area of sky. In this study the SuperCOSMOS catalogue 
is the more dense catalogue and HICAT is the sparse cata- 
logue. A general statement of our problem is for each object 
in the sparse catalogue to choose the correct counterpart 
from the dense catalogue. While it is not guaranteed that 
there is a single link in the dense catalogue, we are only 
dealing with the cases where we assume this to be true. 

In t his paper, we ext end the work previously pre- 
sented in Roh de et alJ (|2004l) by considering the output (new 
matches) of the matching procedure that we have developed. 
This work is also applied to the final version of the HOPCAT 
Catalogue. 

This paper has the following structure. Section 2 dis- 
cusses the problem domain that we are investigating (in 
particular the catalogues involved). Section 3 discusses the 
construction and validation of the model. Section 4 discusses 
how we apply the model to the unmatched portion of the 
HIPASS Optical Catalogue (HOPCAT) in order to match a 
further 1209 objects. Section 5 concludes by making some 
overall comments about our results. 



2 PROBLEM DOMAIN AND CATALOGUE 
DETAILS 

2.1 HICAT 

The HI Parkes All Sky Survey (HIPASS) is a survey of the 
entire southern sky for HI. The HIPASS catalogue (HICAT) 
(|Mever et al.l f2004) was produced by signal processing soft- 
ware run over the HIPASS data cubes. The result of this cat- 
alogue is 4315 HI sources with accurate redshifts and signif- 
icant posi tional uncertaintie s where (RA has a a — 0.78 ar- 
cminutes (jZwaan et al.l20o3) ). HICAT describes each source 
using many parameters, the most important of these are ve- 
locity, peak flux (S p ), integrated flux (Sint) and velocity 
width. 



2.2 SuperCOSMOS 

SuperCOSMOS is a survey of the entire southern sky on 
photographic plates taken by the UK Schmidt Telescope. 
This is imaging data and as such has accurate positions but 
no redshift. 

A catalogue has been produced of the SuperCOSMOS 
Images, the description of the image processi ng used t o ex- 
tract this catalogue is described in lHamblv et alJ ((2001) . For 
this application it was decided that it was best to reprocess 
the i mages using the SExtractor package (jBertin &; Arnoutl 
119961 ) to obtain better segmentation. The SuperCOSMOS 
parameters are area, semi-major axis, semi-minor axis, Bj 
(mag), R (mag) and I (mag). This catalogue contained a 
large number of stars which obviously were non-matching, 
for this reason it was decided to also provide a star-galaxy 
classification. SExtractor can only provide star-galaxy sep- 
aration using its built in neural network when the images 
are from a CCD rather than a photographic plate. For this 
reason the following two step procedure was used to ob- 
tain classes. Diffraction spikes were observed as an obvious 
feature to assist in star-galaxy classification. Software was 
written using the cfitsio library which analysed the images 
and measured the length of the spikes of all objects. A train- 
ing set was then constructed of 1000 galaxies and 1000 stars 
and a support vector machine(see Section 3.3) was trained to 
classify these objects using all of the previously mentioned 
SuperCOSMOS features as well as the diffraction spike fea- 
ture. The use of machine learning techniques have been com- 
mon place for the problem of star-galaxy classification fo r 
some time (iTadiaferri et alJfeOOat iBertin fc ArnoutJll99(tl . 
Using a cross validation methodology where the algorithm 
is tested on data that it was not trained on the star-galaxy 
classifier was able to show a performance of 88%. 



2.3 HOPCAT 

A complementary study bv lPovle et alJ {2004) produced the 
HOPCAT catalogue, which matched 1887 of the 4315 HI- 
CAT sources. The procedure for matching involved joining 
the optical candidates to redshift observations taken from 
the Six Degree Field survey (6dF) (IWakamatsu et alJfeOOafr 
and the NASA Extragalactic Database (NED). If all of the 
optical candidates had redshift information and if there was 
exactly one object matching the HICAT redshift then it was 
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deemed to be a match. Please see (iDovle et aLl fcoO^ for 
more details. 

This procedure allowed the matching of many HICAT 
sources to optical counterparts. It is however a slow pro- 
cedure requiring heavy human intervention and it also was 
inconclusive in cases where there was no additional redshift 
information from 6dF or NED. 



3 CONSTRUCTION AND VALIDATION OF 
MODEL 

While supervised learning algorithms automate much of the 
model construction process, human judgment must be used 
at a number of steps. The choice of input variables must 
be made for the algorithm, this procedure is known as fea- 
ture selection. There is no correct procedure for doing this, 
except to call upon human judgment. Learning algorithm 
performance is generally improved by the choice of a small 
but informative set of features. 

Closely related to feature selection is the preprocessing 
of input variables. For example is it advantageous to provide 
raw magnitudes or colour index information? If the distri- 
bution of a variable is not uniform it may be advantageous 
to transform it prior to learning. 

In this study a number of algorithms will be attempted 
and a procedure known as 10-fold cross validation is used 
to estimate the generalisation performance of these models 
(see Section 3.3). The best of these models is selected and 
performance is reported. 



3.1 Feature Selection 

In order for an optical parameter to be useful it must con- 
vey some useful information, either a relation between the 
optical parameter and radio parameters or something that 
will identify that the object is a galaxy likely to be a strong 
HI source. In contrast a radio parameter is only useful if it 
can be used to identify a relation between radio and optical 
parameters. The asymmetry above is due to the fact that 
there are many optical candidate matches for a single radio 
object. 

In machine learning the parameters that are selected 
to build a model are called features. The rationale for our 
choice of features is as follows: log area and Bj (mag) should 
be roughly correlated with log peak flux (S p ) and log inte- 
grated flux (Sint)- Velocity is also a measure of distance 
so an inversely proportional relationship would be expected 
between velocity and either area or magnitude. We would 
expect highly elliptical optical objects to link with radio 
objects with high velocity widths. It was unclear if galaxy 
colour would contribute to the classifier, although it may 
be a means of detecting late-type galaxies that are likely to 
contain significant amounts of HI. 

The only parameter not mentioned is separation, this is 
obviously useful as we would expect objects with low sepa- 
ration to be more likely to be matches. 

The logarithm was taken of integrated flux, peak flux 
and area so that these would all roughly correlate with mag- 
nitude. A list of all the features selected as machine learning 
inputs is given in Table 1. 



Feature 


Origin 


Name 


1 


Radio- Optical 


Separation 


2 


Radio 


Velocity 


3 


Radio 


Velocity Width 


4 


Radio 


Log Integrated Flux (Si n t) 


5 


Radio 


Log Peak Flux (S p ) 


6 


Optical 


Log Isophotal Area 


7 


Optical 


Semi-major Axis 


8 


Optical 


Semi-minor Axis 


9 


Optical 


Bj (Magnitude) 


10 


Optical 


Bj — R 


11 


Optical 


Bj-I 


12 


Optical 


Star-Galaxy classification 



Table 1. Selected Features for machine learning inputs 



3.2 Framing the Matching Problem as a Pattern 
Classification Problem 

The matching problem is not framed automatically as a pat- 
tern classification problem. In order to make it one we com- 
bine inputs of radio and optical objects into a single vector. 
If the pair of objects are matching then the vector gets a pos- 
itive label, otherwise the pair is given a negative label. The 
negative training points are determined by taking all the 
non-matching objects from the dense catalogue and pairing 
them with the respective object in the sparse catalogue. We 
also employ a mismatched set of negative examples which is 
discussed later. 

It is normal to report the error on both the negative and 
the positive parts of the training set separately. This is par- 
ticularly helpful in situations where the amount of positive 
and negative training data is unbalanced (we have 6.3 nega- 
tive examples for every positive) . If a classifier was to always 
give a negative response it would trivially give a classifica- 
tion of or 86% over all examples: 0% on the positive data 
and 100% on the negative. For this reason the performance 
on the positive and negative data is reported separately. In 
order to avoid the inclusion of massive numbers of small 
and faint objects, only objects with an area greater than 
600 pixels were included in this study. 

Here we report success rates rather than error rates. Er- 
ror rates over the negative data are known as false positives 
and error rates over the positive data are known as false 
negatives. False positives and false negatives are related to 
traditional measures of completeness and efficiency. Both 
completeness and false negatives refer to the objects that 
are lost from the sample due to misclassification. Likewise 
efficiency and false positives refer to the incorrect objects 
that are found in our sample. 

In this situation the relationships between completeness 
and false negatives, and efficiency and false positives are 
complicated by the framing of the problem in terms of binary 
pattern classification. In this situation the classifier is not 
constrained to give exactly one match; the 'combinatorial' 
nature of the output causes there to be no direct relationship 
between completeness and false negatives and efficiency and 
false positives. 
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3.3 Model Selection 

There are a number of supervised learning algorithms that 
are appropriate to apply to this problem. One is the Support 
Vector Machine. The Support Vector Machine (SVM) com- 
putes a nonlinear mapping that transforms its input data 
into a high dimensional feature space where pattern s of dif- 
feren t classes can be separated by a hvperplan e (jVa onik 
1995). The software being used is SVM Light ([Joachims 
1998); this software is free for scientific use. SVMs have a 
number of parameters that can be tuned for optimal per- 
formance, including the kernel function. Kernel functions 
map the data to a high dimensional feature space. The SVM 
searches for a function that is linear in this high dimensional 
space, but non- linear in input space to separate these two 
classes. Popular kernels inc lude linear, polynomial and ra- 
dial basis functions (RBF) (ISchlkopf fc Smolall2002h . 

SVMs also allow the soft margin 

(|Cristianini Sz Shawe-Tavlor1l2000|) to be adjusted which is 
a parameter that controls the trade off between smooth 
and overly complex functions. Controlling this trade off 
is necessary to obtain good generalization. Functions that 
represent the training data well but do not generalise to 
novel examples are said to have overfit the data in machine 
learning terminology. The soft margin is a tool for the SVM 
to avoid overfltting. 

A nother popul ar and older algorithm is the neural net- 
work (Bishop 1995). Neural networks are functions with a 
network-like topology and many free parameters. A gradi- 
ent descent optimisation algorithm is used to partially search 
the parameter space for a suitable representation of the data. 
There are a countless number of heuristics for improving 
or altering the performance of neural networks, however in 
this study we implement the simplest of these algorithms 
i.e. backpropagation. The neural network is used with 3, 4, 
5 and 6 hidden units. A neural network without any hidden 
units (the perceptron) is also used. 

There is no way to know a priori which algorithm will 
give the best performance. The recommended procedure is 
to run a battery of tests using a good selection of candidate 
algorithms and parameters and measure the generalisation 
ability of each. An effective method for getting an accurate 
measure of generalisation ability is the 10-fold test. This in- 
volves dividing the training data into 10 equal parts, an al- 
gorithms is then trained on 9 of the subsets and tested on the 
10th. This procedure is repeated 10 times in order to aver- 
age this result over the entire dataset. The model which gave 
the best generalisation should then be selected. This proce- 
dure is known as cross validation. Table Q] shows the gener- 
alisation performance of multiple learning algorithms with 
different parameters. The SVM has different kernels (linear, 
polynomial and RBF) and different soft margins (0.1, 1 and 
10). The neural networks have different number of hidden 
units (free parameters). Each network was trained for 1000 
iterations ("epochs"). 



3.4 Model Performance 

Running a battery of different algorithms showed that a 
SVM with a third degree polynomial and a soft margin of 10 
was optimal (see Table 0. The percentages reported here 
are the result of 10-fold tests reported separately over pos- 



itive and negative examples. This is useful, because some- 
times the performance over the dataset and the performance 
over the positives and negatives vary considerably. Perfor- 
mance (in terms of percentage correct) over the positive and 
negative examples are reported separately. As the correctly 
classified examples (true positives and true negatives) are in 
all cases distributed over positives and negative data we can 
say that in all cases non-trivial models are found. 

3.5 Feature Importance 

Feature importance is the determination of how much in- 
formation is given by each input. Feature i mportance is no- 
toriously difficult (iGuvon fc Elisseeffl l2003). The reason for 
this is that in different combinations features have different 
effects, so in reality importance is a combinatorial problem. 
In the case of a matching problem the combinatorial nature 
is emphasised because inputs are of interest in the amount 
that they correlate with other inputs. 

The measuring of the importance of the input is also 
highly tied to the problem of estimating the classification 
model. Adding features can make the estimation more dif- 
ficult due to the curse of dimensionality 1 . This has the po- 
tential to make the addition of useful features reduce overall 
classification performance. 

The construction of our learning problem leads to some 
unusual characteristics. The positive learning vectors consist 
of variables from the sparse catalogue and the dense cata- 
logue joined together. This means that the information from 
the sparse catalogue is repeated for every entry candidate 
match in the dense catalogue. Simulations on input impor- 
tance have shown that optical parameters alone are often 
sufficient to achieve moderate classification. In this case we 
obtain a classification of 94%. At the outset of this project 
it was hoped that there would exist relationships between 
the radio and optical parameters which would aid in clas- 
sification. This simulation shows that apriori, rejection of 
objects (stars and galaxies) from the dense catalogue is a 
more powerful element of this problem. 

A special mismatched dataset was introduced to test 
the hypothesis that radio data could contribute any useful 
information. The dataset consisted of the normal positive 
matches, plus a random sample of radio sources matched 
to distant optical sources. The separation feature was re- 
moved from this simulation. Without radio information on 
this dataset 47% classification was achieved, while when ra- 
dio information was added classification improved to 72%. 
This confirmed that relationships do exist between the radio 
and optical parameters of these galaxies. 



4 APPLICATION OF THE MODEL TO 
UNMATCHED DATA 

In order to apply our binary classification model to a HI- 
CAT source it must be evaluated against every candidate 

1 The curse of dimensionality refers to the exponential increase 
of hypervolume as a function of dimension. Finding good models 
(discriminating functions) that lie in a high dimensional space, is 
known to be more difficult than finding models in a lower dimen- 
sional space. 
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Table 2. Performance of Algorithms and parameters 



Algorithm 


Soft Mar 


gin Pos Data 


Neg Data 


Overall 


(Kernel) 


(c) / hu 


% correct 


% correct 


% correct 


O V 1V1 


U. 1 


C7 A_7 -1-9 OA 


no 7n _i_n 97 


Q7 1 7 


Linear 


1 


88.94 ±2.43 


98.75 ±0.50 


97.41 




10 


88.80 ±2.44 


98.80 ±0.25 


97.43 


SVM 


0.1 


90.04 ±0.31 


99.04 ±1.67 


97.93 


Poly 


1 


94.18 ±1.91 


99.44 ±0.20 


98.72 


d=2 


10 


96.02 ±1.47 


99.53 ±0.29 


99.05 


SVM 


0.1 


94.91 ±1.93 


99.46 ±0.27 


98.84 


Poly 


1 


96.24 ±1.83 


99.54 ±0.20 


99.09 


d=3 


10 


96.69 ±1.26 


99.50 ±0.42 


99.12 * 


SVM 


0.1 


89.39 ±2.58 


99.21 ±0.27 


97.87 


RBF 


1 


93.66 ±2.47 


99.50 ±0.28 


98.70 


7 = 1 


10 


95.43 ±1.69 


99.66 ±0.17 


99.08 


Perceptron 




86.81 ±7.73 


97.52 ±2.78 


96.05 


Neural Net 


hu=3 


93.81 ±1.68 


95.50 ±1.41 


95.27 




hu=4 


94.10 ±2.07 


95.46 ±1.38 


95.27 




hu=5 


93.50 ±3.59 


95.48 ±1.26 


95.21 




hu=6 


93.45 ±2.01 


95.62 ±1.23 


95.32 



Note: The errors reported are the standard deviation on the performance rate found when doing a 10-fold test. The asterisk (*) denotes 

the model with the best overall performance. The overall result takes in to account that there is approximately 6.3 times as much 
negative data as positive, this results in more importance being required on classifying negative data correctly. The overall percentage 

correct is given by the formula : R OV erall — 0.1365 x R p0 s + 0.8635 X R n eg 

In the second column, hu refers to hidden units in a neural network. 

match in the HICAT region of uncertainty. This increases 
the chance of error from the above estimate because many 
model evaluations are required. There is also the chance that 
the classifier will find no matches, a single match or multiple 
matches. The performance measures given previously were 
for binary classification problems. The statistics of false pos- 
itives and false negatives are highly related to, but are not 
measures of completeness and efficiency. 

In order to see how well our model applies to the actual 
problem we examine only the unique velocity matches and 
test what agreement level this has with the HOPCAT cat- 
alogue. We take only the 1608 unique velocity matches, out 
of 1887 (this has an immediate bearing on the completeness 
of the catalogue) . A sample of images of newly matching ob- 
jects is shown in Fig Q Of the 1608 only 9 are misclassified, 
indicating that the catalogue has high efficiency. 

Accurate estimates of completeness and efficiency are 
not possible in this case for three reasons. The training data, 
and the data to which we apply the model have slightly dif- 
ferent distributions. Our classifier output is a binary output 
over each output, allowing for ambiguous situations such as 
multiple matches to exist. A cross validation method (taking 
in to account unique matches) should be applied to produce 
this estimate. Finally we do not know how accurate the la- 
bels on our training data are (HOPCAT). 

By ignoring cases of multiple matches we are able to sac- 
rifice completeness for efficiency. We only get a false match 
when there are exactly two matches (one false positive and 
one false negative). This provides a level of error checking 
and means that the classifier is not applied to the difficult 
or ambiguous examples. 




(a) (b) (c) 




(d) (e) (f) 




(g) (h) (i) 

Figure 1. A sample of the new matching objects 
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Figure 2. Misclassified objects, a), b) are spurious ellipses 
marked as matches, c) d) and e) optical measurements of the 
velocities of the galaxies show that machine learning chose the 
wrong galaxy, f) g) and h) machine learning disagrees with HOP- 
CAT, but there is insufficient information to establish which is 
correct, i) a bright star at close proximity is chosen as a match. 



It is noteworthy that an error here requires exactly two 
errors over all the candidates and one of these must be on 
the match. The 9 misclassified objects are shown in Fig |2] 
This provides a form of error checking as if there are multiple 
matches then the chance that either the classifier has failed 
or that the match is ambiguous is high. A sample of images 
correctly classified are shown in Fig Q and the 9 images 
incorrectly classified are shown in Fig |2] 

HOPCAT contained 2221 objects which had insufficient 
information to match. It is this data that we wish to extract 
new information from, by matching it using machine learn- 
ing. 

The machine learning model found 1209 of these were 
assigned unique matches by the model. The high accuracy 
on the test set suggests that a very high proportion of these 
matches are correct. 

A plot of radio flux against optical magnitude of the 
old and new points is shown in Fig |3| The new data points 
appear to follow the same trend as the old data points. Al- 
though it is obvious that the two distributions are different: 
the new points are more likely to be fainter in both the op- 
tical and radio flux. This is most likely due to a selection ef- 
fect where the training data contains brighter objects. It ap- 
pears that the model is successfully extrapolating to fainter 
objects than the training data. The authors would like to 
stress that the quality of the machine learning model should 



• HOPCAT Data 

x Machine Learning data 



2 - 




Bj (mag) 

Figure 3. Magnitude Flux plot of old and newly obtained data- 
points 




Figure 4. Distribution of integrated flux (Si n t) 



be judged on the cross-validation performance, not the good 
agreement found here. 

The SuperCOSMOS catalogue goes deeper than HI- 
CAT. The non-linear detection limits on HICAT can be seen 
in the distribution of integrated flux (Sint) Fig 0]and peak 
flux (S p ) Fig [51 The effect of this threshold is that objects 
with an Sint < O.bJyKms -1 are under represented in Fig El 
while the limit on optical magnitude is low enough to have 
negligible effect. This may be responsible for a subtle curve 
upwards for the faint end of the spectrum in Fig 3. 

Over the 216 blank fields 6(3%) had one or more match 
on them. This gives a rough indication of the frequency of 
false positives. 
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Figure 5. Distribution of peak flux (S p ) 



5 DISCUSSION 

The matching of catalogues can be framed as a supervised 
learning, pattern classification problem. Despite differences 
between matching and pattern classification the algorithms 
performed remarkably well on this data, showing perfor- 
mance over 99%. The model we found produced the most 
discriminating power from the optical (dense) catalogue, 
however we were able to show that important relations ex- 
isted between the two catalogues. 

This method was successful in generating 1209 new 
matches to the HOPCAT Catalogue, bringing the total num- 
ber of matches to 3096 out of 4315. For a significant portion 
of the HICAT sources it is difficult or impossible to find a 
match because there are many optical counterparts; or the 
optical counterparts are obscured by the zone of avoidance. 

The quality of both the source of the training data 
(HOPCAT) and the additional counterparts found using 
machine learning, need to be verified using high resolution 
radio data from the Australian Telescope Compact Array. 
Verification of some or all of the data would further validate 
the methods used here. 

This work uncovers a number of new avenues to in- 
vestigate further. There are simple methods that could be 
applied to get a probability that each candidate is a match. 
This would allow assumptions such as allowing at most one 
match to be built in to the classifier. 

The selection effects that could be caused by such a 
method are potentially complex. The newly matched data- 
points are likely to show similarity to points in the training 
data. This opens up two questions. Firstly, if we do not have 
any rare objects in the training data, then we are probably 
unlikely to find these objects in the newly matched data. 
Moreover if our new data points resemble our old datapoints, 
what aspects of the new distribution of points are simply 
resemblance to the old data, and what aspects are giving us 
new information, not in the original sample? 
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